Optimal reconstruction of dynamical systems: A noise amplification approach 
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In this work we propose an objective function to guide the search for a state space reconstruction 
of a dynamical system from a time series of measurements. This statistics can be evaluated on any 
reconstructed attractor, thereby allowing a direct comparison among different approaches: (uniform 
or non-uniform) delay vectors, PCA, Legendre coordinates, etc. It can also be used to select the most 
appropriate parameters of a reconstruction strategy. In the case of delay coordinates this translates 
into finding the optimal delay time and embedding dimension from the absolute minimum of the 
advocated cost function. Its definition is based on theoretical arguments on noise amplification, the 
complexity of the reconstructed attractor and a direct measure of local stretch which constitutes a 
novel irrelevance measure. The proposed method is demonstrated on synthetic and experimental 
time series. 
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I. INTRODUCTION 
A. Background 

Nonlinear time series analysis has been actively de- 
veloped during the last decades following a fundamen- 
tal theorem by Takens pQ. He proved that it is pos- 
sible to reconstruct the attractor of a dynamical sys- 
tem using only a single time sequence of scalar mea- 
surements from the system under study. The vectors 
accomplishing this reconstruction have the form x(t) = 
(x(t),x(t — r), x(t — (m — 1)t)), where x(t) are our ob- 
servations on the system, the integer number m is known 
as the embedding dimension, and r is the time differ- 
ence between consecutive components, also called time 
lag or delay time, which is some multiple of the sampling 
time. Takens assumed an infinite sequence of noise-free 
measurements and proved the existence of a diffeomor- 
phism between the original and reconstructed attractors 
for almost any choice of positive delay times r and a suf- 
ficiently big dimension m. This approach is known as 
delay coordinate embedding and constitutes the first step 
of almost all nonlinear time series analysis methods such 
as the determination of attractor dimensions, Lyapunov 
exponents and entropy [2 . Takens' ideas were later re- 
visited by Sauer et al. [3] who proved that for compact 
attractors the embedding dimension m must be larger 
than twice the box-counting dimension of the original 
attractor to ensure a one-to-one reconstruction. In prac- 
tice, however, the box counting dimension of the original 
attractor is unknown and as a consequence the minimal 
embedding dimension must be derived from the data. 

The reconstruction problem starts with the measure- 
ment of appropriate physical quantities in order to ensure 
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further system state reconstruction. This is a very impor- 
tant instance in the reconstruction process which affects 
the rest of the process. The problem of selection of an 
optimal measurement function allowing an information 
flow from the unobserved variables to the observed vari- 
able was first discussed in [4] and later studied in [5H7]. 
This is a very interesting subject but out of the scope of 
this paper. 

The selection of the delay time r, although irrelevant in 
Takens' formal derivation, becomes important for experi- 
mental time series due to their finite length and measure- 
ment precision. More precisely, the quality of the recon- 
struction and the extraction of diffeomorphic invariants 
thereof are affected by the time window size t w = (m— l)r 
determined by the choice of r and m as argued in the lit- 
erature by many authors [4)[8HT6]. If a small (large) value 
of t w is chosen a phenomenon known as redundance (ir- 
relevance) deteriorates the reconstruction quality [4j [8] . 
We therefore see that, for delay coordinate reconstruc- 
tions, two parameters (say r and m or t w and m) need 
to be chosen according to some optimality criteria. In 
general, it can be expected that the optimal reconstruc- 
tion parameter values will be determined simultaneously. 

Many authors have proposed methods to select an opti- 
mal delay time by minimizing redundance between com- 
ponents [T3H22] . However, these methods require addi- 
tional argumentations on how to avoid irrelevance. In 
particular, Fraser and Swinney proposed, in their early 
work [17 , to use the first minimum of the Mutual In- 
formation (MI) between delayed components, and their 
method has now become the reference standard. By us- 
ing the first instead of the absolute minimum they at- 
tempted to bias the selection towards small delays, be- 
cause large ones could imply irrelevance between compo- 
nents. Such criterion has potentially three drawbacks: i) 
The first minimum could be a noise-induced fluctuation 
instead of a true local minimum [9J . ii) There is in general 
no evidence that, in addition to minimizing redundance, 
the ad-hoc choice of the first relative minimum of MI 
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FIG. 1. Profiles of the proposed cost function as a function of 
the dimension m and time window t w for the Mackey-Glass 
time series. The thick dashed line corresponds to the case of 
using all delay coordinates within the time window t w (i.e. 
using r — 1 and m — t w + 1, with r and t w expressed in 
units of sampling time). The cost function has an absolute 
minimum at m = 4, t w = 30. 



should as well minimize irrelevance, or at least keep it 
low. iii) Most chaotic flows have a characteristic oscillat- 
ing period which will modulate the MI profile generating 
a structure of maxima and minima. More precisely, the 
first maximum of MI will occur at this characteristic pe- 
riod, which will therefore act as an upper bound for the 
first minimum even if the irrelevance time of the time 
series is larger than this period. These three arguments 
are valid for all methods in [14-22 , where a redundance 
measure is proposed and a first minimum (or maximum) 
must be determined. 

On the other hand, methodologies based on dynami- 
cal arguments can be found in the literature on attractor 
reconstruction [22-25 which can simultaneously deter- 
mine both the delay time and the embedding dimension. 
These methodologies are based on tracking the dynam- 
ical evolution of nearest neighbors in the reconstructed 
space and measuring their divergence, and are intimately 
related to the concept of False Nearest Neighbors (FNN) 
introduced by Kennel et al. [26] . In this manuscript we 
will refer to them as dynamical methods. The idea of 
seeking false neighbors to determine the embedding di- 
mension was considered also in [27H29] , 

Recent years have seen an increasing interest on the 
development of reconstruction techniques for prediction 
purposes p~2j l30l 39 . These recent studies have fo- 
cused on producing non-uniform delay coordinate vectors 



whereby each component is associated to a different delay 
time. The problem of non-uniform delay reconstruction 
has also been addressed by Pecora et al. [40] and Garcia 
et al. [4TJ [42] . Both methods require the selection of the 
first minimum (or maximum) of their proposed measures 
and are only applicable to (non-uniform) delay coordi- 
nate reconstructions. Holstein and Kantz [39 proposed 
a generalized embedding approach for the case of time 
series modeling in a Markovian sense. 

The quality of a reconstruction was quantified by Cas- 
dagli et al. [4] in terms of its (observational) noise amplifi- 
cation effect when one wishes to estimate the state of the 
system. They denned the noise amplification a to locally 
measure this effect, a statistics which can potentially be 
computed from the observed time series. In principle, 
o~ allows for an absolute comparison among reconstruc- 
tions, which in turn enables a selection of optimal recon- 
struction parameters. An obstacle, however, is given by 
the hypothesis of a full knowledge on the true generating 
dynamics. In [4 the authors suggest to estimate a via 
a data-driven approximation of the dynamical evolution 
law. 

In this work we build on the idea of noise amplification 
of Casdagli et al. to define a new cost function which is 
fully computable from the time series and its reconstruc- 
tion. Embedding parameters are then determined from 
the absolute minimum of the proposed objective function, 
which can be calculated for any kind of reconstruction. 



B. Overview of this work 

In this paper we propose a criterion to select an optimal 
state-space reconstruction of the dynamics of a physical 
system from a time series of measurements performed 
on the system of interest. The presented methodology 
is based on the minimization of a cost function L which 
is readily computable from the available observational 
data. Different reconstructions, whether multivariate or 
time-delayed univariate, uniform (equally spaced) or not, 
can be directly compared through L, and thereby the 
suitability of different reconstruction settings can be as- 
sessed. 

For example, Fig. [T] illustrates how optimal param- 
eter values for the dimension m and time window t w 
can be simultaneously determined by a global optimiza- 
tion of the proposed cost function for the Mackey-Glass 
time series [43]. An important advantage of the pro- 
posed approach is given by its automatic and objective 
character, in contrast to, for example, the subjective or 
practitioner-dependent choice on the location of the first 
local minimum of Mutual Information (MI) , or the value 
of a threshold characterizing a negligible fraction of false 
nearest neighbors (FNN). 

The proposed cost function L is a local property of the 
reconstruction, which we then average over the attractor. 
In Fig. [2] we illustrate the local behavior of L by plotting 
a two-dimensional projection of the Mackey-Glass attrac- 
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FIG. 2. (Color online). Color-coded local cost function in two-dimensional projections of ra-dimensional delay coordinate 
reconstructions of the Mackey-Glass attractor. The time window t w — 30 is the same for all reconstructions but the dimension 
increases from left to right, (a) m — 2. (b) m — 3. (c) m = 4. 



tor [43] as we reconstruct it in spaces of increasing dimen- 
sion — from m = 2 (panel (a)) to m = 4 (panel (c)). As 
we can see in panel (a), L correctly senses regions of the 
attractor not yet unfolded for m = 2, i.e. regions where 
orbits with different dynamical evolutions overlap. These 
regions progressively vanish in higher-dimensional recon- 
structions, as reflected by lower values of L in panels (b) 
and (c). Notice that the local cost function serves as a 
complementary tool that enables an additional intuition 
for the problem: the global cost function difference be- 
tween the m — 3 and the optimal reconstruction (m = 4, 
Fig. [T]) gains significance in the light of Fig. [2^b) which 
shows localized regions of the attractor with large values 
of the local cost function. 

As an independent validation of the proposed recon- 
struction methodology we here consider forecasting per- 
formance for a range of horizons from short to long term. 
More precisely, we use the proposed approach to deter- 
mine a reconstruction space and compare the prediction 
accuracy of local linear models against the reconstruction 
arrived at following the standard approach (MI + FNN). 
The top panels of Fig. [3] contrast prediction errors at a 
fixed horizon as a function of the number of neighbors 
k involved in the prediction for Mackey-Glass, Rossler 
and Lorenz datasets (for a more detailed account of the 
datasets and simulation settings employed please refer to 
Appendix [A| . The lower panels, instead, illustrate the 
case of a fixed number of neighbors and a variable pre- 
diction horizon T (for ease of interpretation, measured 
in units of the characteristic — first recurrence — time of 
each system) . Figure [3] demonstrates that more accu- 
rate forecasting can be achieved at a range of horizons 
with the proposed approach. This follows from the fact 
that our methodology produces smooth embeddings for 
which the complexity of the prediction law is minimized 
and the dynamics can be more efficiently approximated, 
as discussed in the following sections. 

As we detail below, L is built upon the concept of noise 
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FIG. 3. (Color online). Comparison of prediction errors (nor- 
malized root mean squared error) of local linear models for 
reconstructions obtained following the standard approach (MI 
+ FNN, dashed line) versus the proposed methodology (full 
line) for Mackey-Glass (panels a and d), Rossler (6, e) and 
Lorenz (c, f). Top panels: forecasting error at a fixed hori- 
zon as a function of the database fraction (k/N) employed for 
model building. Bottom panels: prediction error at a range of 
horizons T from short to long for a fixed number of neighbors. 
The horizon T is measured in units of the first recurrence time. 



amplification introduced by Casdagli et al. [4 . However, 
in this manuscript we present a new interpretation of 
a in terms of the complexity of the prediction law de- 
fined on the reconstructed attractor, as advanced in the 
previous paragraph. We also show that the practical im- 
plementation of the proposed method is strongly related 
to: a) the FNN (False Nearest Neighbors) method pro- 
posed by Kennel et al. [26], and b) dynamical methods 
[22H25] . The cost function also incorporates a novel ir- 
relevance measure based on a direct computation of the 
reconstructed attractor local stretch. 
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In Section [TT] we introduce the notation we shall use 
throughout this work, classify the most common recon- 
struction strategies and discuss under which conditions 
a reconstruction is optimal. We review the definition 
of noise amplification in Sections III A We then show 
the equivalence between this definition and a measure 
of complexity of the resulting prediction law, and dis- 
cuss the consequences of this reinterpretation in Section 
Then, in Section |III C| we propose an estimation 
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method based on a first neighbors approximation which 
enables the computation of noise amplification from time 
series. In Section III D| we propose a formulation of the 
cost function which naturally incorporates a penalization 



term to account for irrelevant components. In Section IV 



we discuss how our approach is related to other methods 
in the literature. In Section [V| we present field data ap- 
plication examples. Finally, in Section VI we draw our 
conclusions. 
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FIG. 4. (Color online). Schematic representation of the mea- 
surement process and reconstruction of the attractor of a dy- 
namical system. This figure is inspired in Casdagli et al. [4]. 
See the main text for details. 



II. THE RECONSTRUCTION PROBLEM 

A. Introduction and notation 

In this work we will follow the notation employed in 
[4]. The time series x(t) is the sequence of measurements 
performed on the system under study at regular intervals 
St. We indicate with s(t) the state of system at time t, 
which evolves according to a deterministic dynamics on 
a d-dimensional manifold M: 

s(t) = f(s(0)), (1) 

where f l is the evolution law for a time step t. The 
time series is then the sequence x(t) = h(s(t)), where 
ft : M — >• R is a smooth measurement function defined on 
the original state space. The time series x(t) is the only 
observable, being s(t), /* and ft unobservables locked in 
a black box [4]. 

Figure [4] gives a schematic representation of the state 
space reconstruction process, starting from the measure- 
ment process. In its most general form, a reconstruc- 
tion of the state space vector s(t) can be described 
with an n-dimensional vector y(t) = \P(x(t)), where 
x(t) = (x(t),x(t — r),...,x(t — (m — l)r)) is the delay 
coordinate (DC) vector at time t, and & : R m — » R n 
is a further transformation that accounts for the possi- 
bility of considering a more general transformation (e.g. 
non-uniform delay coordinates, global or local SVD, or 
a noise reduction algorithm). Here we will first focus 
on DC reconstructions x(i), but the final methodology 
will be generally valid for arbitrary reconstructions of 
the form y(t) described above. 

The DC reconstruction defines a map <P : M — >> R m 
such that x = @(s). Under the hypotheses considered 
by Takens, <P will yield a diffeomorphism, and in such a 
case we will obtain an embedding of the attractor in the 
reconstructed space. The dynamics is given by a function 



F t fully determined by the original dynamical law /* and 
<2>: 

x(t) = F*(z(0)) = <Pofo ^(^(O)). (2) 

The value of x(t+T) will be given by the reconstructed 
vector at time t, and can be obtained by applying the evo- 
lution operator F T to x(t) and retaining its first compo- 
nent. This operation can be captured by the function 

g T = *-F T (3) 

where 7r is the column vector (1, 0, . . . , 0), in such a way 
that x(t + T) = g T (x(t)). The same as F , the function 
g T is completely determined by ^, the original evolution 
law / T , and the measurement function ft, being 

g T = hof T o$-\ (4) 

In the following we will use the simplified notation x(T) 
instead of x(t + T) to denote the value of the time series 
T time steps after t, which will be implicit in the no- 
tation. Accordingly, x and s will refer to x(t) and s(t) 
respectively. 

B. Reconstruction strategies 

Delay coordinate (DC) reconstruction is the most com- 
mon strategy for attractor reconstruction. However, sev- 
eral alternatives exist ranging from derivative coordinates 
[45] or global principal value decomposition [46] to non- 
uniform DC vectors. The latter possibility has been ex- 
tensively explored in the literature in recent years [T2]l30l- 
142] 144] . All of these approaches can be described in terms 
of Fig. [4] i.e. by means of a transformation ^ applied to 
the DC vector. Indeed, any coordinates defining a re- 
construction y(t) from samples in the interval [t — t w ,t] 
can be thought of as the result of applying a transfor- 
mation ^ to the full delay coordinate vector (fDC) which 



5 



contains all available data from [t — t w ,t] (i.e. r = 1 and 
m = t w + l, with r and t w expressed in units of sampling 
time). 

Back on the domain of linear transformations, Gibson 
et al. [8 found an analytical solution for PCA in the 
limit of a small time window width. The 'small window 
regime' refers to widths smaller than 

T* w = 2y/3(x*) /((dx/dt)*) (5) 

where the time series values x have been normalized to 
zero mean. Their analytical solution has principal direc- 
tions given by discrete Legendre polynomials. For this 
case, W is a linear transformation which projects the fDC 
vector from the time window t w onto n directions given 
by the first n discrete Legendre polynomials. Therefore, 
the free parameters to build the Legendre coordinates are 
the time window width t w and the dimension n. Gibson 
et al. gave a guidance for choosing t w , but they argued 
that there is no simple rule to select n. In order to avoid 
redundance and irrelevance and achieve a good balance 
between signal-to-noise ratio and complexity, they pro- 
posed to use a window width smaller than but close to 
[8]. However, there is no demonstrated relationship 
between and the characteristic irrelevance time, and 
it can not be discarded that time window widths larger 
than could provide useful information about the sys- 
tem state for its reconstruction. 

Gibson et al. also showed that, in the limit of a small 
DC dimension m, applying the discrete Legendre poly- 
nomials transformation to the fDC vector is equivalent 
to a finite differencing filter which will recover derivative 
coordinates. Therefore, derivative coordinates are also 
encompassed by the same linear transformation frame- 
work. 

Finally, non-uniform delay coordinate (nuDC) recon- 
struction has been proposed as a generalization of stan- 
dard DC reconstruction by allowing a different time lag 
for each component of the DC vector, namely 

y(t) = 00), x(t - ri), x(t - r 2 ), x(t - T( n _i))), (6) 

where the dimension n and the n — 1 delay times 
{ti,T2, T( n _i)} are the set of parameters to be deter- 
mined. Also in this case the reconstructed vector y can 
be obtained from the fDC vector on t w = max^r^). The 
dimension is reduced from m = t w + 1 to n by retaining 
only the n coordinates corresponding to the non-uniform 
delays. It is important to remark that this procedure con- 
stitutes a linear projection \P onto a subspace and that 
the iDC reconstruction is not a new procedure outside 
the reconstruction scheme of Fig. [4] proposed in [4] . 

In summary, existing strategies in the literature consist 
of a linear projection of the fDC vector onto a subspace 
(in particular, this is also the case for the usual, uniform 
DC reconstruction). It is unclear which strategy will be 
optimal in each case. 



C. When is the reconstruction optimal? 

Once the reconstruction problem has been defined, the 
question arises as how to choose the free parameters t w 
and m (and, eventually, also the parameters associated 
to a further transformation ^) in order to achieve the 
best possible reconstruction. To answer this question one 
must define the criterion of optimality. In other words: 
what would best mean for the purpose of the reconstruc- 
tion? Takens' theorem gives no guidance to define such 
criterion, being almost any r and a big enough m equally 
valid solutions for the noise free and infinite amount of 
data case analyzed by the theorem. The presence of noise 
and the finite amount of data introduce limits to the qual- 
ity of the reconstruction in the sense that any measure 
we would later estimate from it would require modeling 
the distribution of points, introducing noise amplifica- 
tion and estimation error [4]. The aim of the optimal 
reconstruction will be to simultaneously minimize these 
two effects. Observational noise amplification is reduced 
for well unfolded attractors in the reconstructed space 
because, roughly speaking, well separated states will be 
harder to mix by noise displacements. On the other hand, 
excessive unfolding will at some point produce an overly 
complicated reconstruction that will later require more 
data points in order to be modeled. The compromise be- 
tween these two opposite scenarios will then depend on 
the noise level and the amount of available data. 

Another idea of optimality is based on minimizing the 
distortion of the original attractor introduced when ap- 
plying the reconstruction map If we assume that the 
original attractor is known, then it is possible to define 
measures of distortion between the two attractors as in 
[4j [TTJ [47]. However, there is no reason to assume that 
the original attractor constitutes the best representation 
of itself. For example, Pecora et al. suggest that a com- 
bination of original and delay coordinates of the Lorenz 
attractor can produce a better reconstruction than the 
original attractor itself [40]. Beyond these arguments, we 
will not use these distortion measures as our cost func- 
tion because we will assume that the original attractor is 
unknown. However, the point we would like to question 
here is whether these measures [4j [Til Hi] are absolute 
measures of the reconstruction quality. 

III. CONSTRUCTION OF THE COST 
FUNCTION 

A. Noise amplification 

The concept of noise amplification as given by Casdagli 
et al. [4 aims at quantifying the effect that observational 
noise on x has on our uncertainty about the system state 
s. Given that the state of the system is unknown and we 
only have access to the observational time series x = h(s), 
it is impossible to evaluate the quality of a reconstruc- 
tion by comparing the reconstructed attractor with the 
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original one. However, the quality of a reconstruction 
can be assessed by considering the predictive power that 
it allows for. In this context, the definition of noise am- 
plification (see [4] for details) is given by: 

a(T,x) = lim a e (T,x) (7) 

e— >-0 

where 

cJe (T, x) = ± ^Var(x(T)\B e (x)) (8) 

being Var(x(T)\B e (x)) the conditional variance of x(T) 
for x in a radius e ball B e (x) defined by an observational 
noise level e. According to the definition given in [4], 
Var(x(T)\B e (x)) does not contain any contribution from 
modeling error. Instead, the exact form of g T is assumed 
to be known and used to compute the width of the image 
of B e (x) when passed through g T . In the limit e — ^ the 
value of a(T, x) is independent from e but only a func- 
tion of the reconstruction as given by Finally, in [4] 
the authors considered the mean square value of <r(T, x) 
with respect to the measure of the attractor, (a 2 (T)), to 
globally characterize the reconstruction. 

The forward time step T on which Var(x(T)\B e (x)) 
is evaluated is a free parameter of a(T, x). Notice, how- 
ever, that the evaluation of a(T, x) on a single value of 
T is insufficient to correctly characterize the divergence 
of reconstructed orbits since the measurement function 
can collapse different states onto the same value. A more 
robust measure of the divergence between neighboring 
orbits is obtained by considering the average of cr 2 (T, x) 
for T on the interval [0,2m] for some upper value Tm- 

Therefore, we redefine a e (x) as 

i pTm 

° 2 M = 7f^ * 2 AT,x)dT, (9) 

J-M Jo 

and consider the limit 

<j(x) = lim <J e (x) (10) 

instead of the original definitions given in eqs. ([7| and ([8| 
respectively. We will keep the notation cr e (T, x) (with an 
explicit indication of parameter T) for the cases where 
the average over T in [0, Tm] is not performed and when 
we would like to evaluate this quantity — and others de- 
rived from it — at a single instance in future, T. 

The choice of parameter Tm will determine the sensi- 
tivity of <j(x) to the quality of the reconstruction. How- 
ever, there is no need to make an accurate selection of 
Tm provided that it is large enough as to capture or- 
bits divergence. The purpose of setting an upper bound 
for Tm is just to achieve a better sensitivity of a to the 
optimum reconstruction. Our method based on the es- 
timation and minimization of cr e (x) produces consistent 
results for a wide range of Tm values, as we shall show 
in Section HVCl 



B. Complexity of the prediction law 

Although the purpose of the definition of cr(T) is to 
characterize the reconstruction determined by it de- 
pends exclusively on the function g T . Therefore, it 
should be possible to derive an expression of cr(T) in 
terms of g T only. This is the purpose of this section. 

Let B e (x) be a Gaussian ball with standard deviation 
e, i.e. a multivariate normal distribution with a diagonal 
covariance matrix E with all entries equal to e 2 . In this 
context, Var(x(T)\B e (x)) is the variance of this Gaussian 
distribution mapped through g T to R. As we aim to take 
the limit e — » 0, we can consider e small enough as to 
make a first order approximation around x 

9 T (S + 0=9 T (x) + b^ + O(U\\ 2 ) (11) 

where b = Vg T (x) and £ represents a displacement from 
x with a size ||£|| bounded by e. In this linear limit the 
resulting mapped distribution is also normal with a vari- 
ance given by b^Ttb = e 2 ||V^ T (x)|| 2 . According to this 
result we find that 

a\T,x) = \\Vg T {x)\\ 2 - (12) 

This relationship, not considered in [4], allows for a new 
interpretation of the minimization of (<r 2 (T)) in terms 
of the complexity of the resulting g T . A change in the 
reconstruction modifies the support where g T lives and 
therefore g T itself. The smoothness of g T can be quanti- 
fied by (||Vg T (x)|| 2 ), which can then be interpreted as a 
measure of the complexity of this function. Minimizing 
(<j 2 (T)) is therefore equivalent to minimizing the com- 
plexity of g T . 

This new interpretation is per se sufficient to consider 
the minimization of (<r 2 (T)) as an optimality criterion 
to guide the reconstruction process, independently of the 
original arguments on the effects of observational noise. 
The law g T will be easier to model the lower the value of 
(<r 2 (T)) is, and therefore fewer parameters will be needed 
to describe it. Furthermore, smoothness of the dynam- 
ics is the first assumption when modeling a system with 
the purpose of forecasting — e.g. with techniques such as 
artificial neural networks, radial basis functions, or inter- 
polating splines. Building on this hypothesis, all of these 
modeling approaches include a regularization parameter 
to avoid overfitting [48]. More importantly, even when 
forecasting is not the final purpose but estimating an in- 
variant quantity such as the maximum Lyapunov expo- 
nent, these algorithms are also implicitly applied, being 
fc-NN the most frequently used. Therefore, it is crucial 
that the unknown function g T complies, as closely as pos- 
sible, with the hypotheses made on it when it is modeled. 

A further consequence of this reinterpretation is the 
possibility to generalize the definition of a to the more 
general reconstruction vector y = \P(x). In [4 the e-ball 
B e (x) is originated in i.i.d. observational noise in each 
component of the DC vector x and can only be associ- 
ated to it. Any further transformation will distort this 
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noise-induced e-ball. For the new interpretation in terms 
of ||Vg T (x)|| 2 , the e-ball is a mathematical construction 
to capture the behavior of the neighborhood of and as 
such is therefore also applicable to y. 



C. Estimation of noise amplification with fc-NN 

In practical applications the law g T is inaccessible (be- 
ing unknown both the dynamical law j l and the mea- 
surement function h\ and the only available information 
is the time series itself. In this context we propose to 
estimate cr 2 (x) by recursing to the nearest fc neighbors of 
x [49]. These fc neighbors and x define a set ti^{x) with 
fc + 1 elements which will act as a proxy for the ball B e (x) 
in the following definitions. 

We approximate the conditional variance 
Var(x(T)\B e (x)) by 



E 2 (T,x) 



fc + 1 



x' EUk (x) 



[x'(T)-u k (T,x)} 2 (13) 



where x'(T) is the future value of x corresponding to x' \ 
and 



Uk 



(14) 



x'EUk (x) 



In order to capture the time average over T in [0, Tm] 
in Eq. ([9|, we define E^(x) (without explicit T notation) 

as 



= lit 



(15) 



where the integral has been replaced for a sum over the 
actual p sampled times Tj in [0, Tm]. 

We estimate the size of the neighborhood as 



40) 



=//|2 



fc(fc + l) 



(16) 



x' ,x" EUk(x) 



which is a robust measure of the characteristic square 
radius of Uk(x). Notice that here we are making no as- 
sumptions on the box counting dimension of this set. 

Finally, the noise amplification estimated from k near- 
est neighbors is [50] 



(17) 



and we consider the average over N reference points Xi of 
the reconstructed attractor randomly selected from the 
time series to define the global measure 



1 N 



(18) 




FIG. 5. Schematic representation of two delay coordinate 
(DC) reconstructions, (a) ^i and (b) ^2, with n < T2. The 
distance between neighbors increases with the delay time r 
due to the stretching and folding of the attractor, while the 
global characteristic size L of the attractor remains constant 
because it only depends on the amplitude of the time series. 



For this fc-NN estimation of (a 2 ) a new free parameter 
has been introduced to the method, namely the number 
of neighbors fc. If the value of fc is chosen too large, the 
linear approximation of g T around x will be invalid and 



cr 2 (x) will differ from the 



e — » 



limit a 2 (x). On the 



other hand, fc needs to be large enough for the conver- 
gence of the estimator of the conditional variance (Eq. 
13). In Section IV A we will present practical consider- 



ations concerning the selection of appropriate values for 
fc. 



D. Normalization 



In the framework of the new interpret ation o f cr 2 (x) in 
terms of ||Vg T (x)|| 2 discussed in Section IIIB 1 the value 
of e is no longer related to the observational noise and 
therefore its scale can change with the reconstruction. 
For example, were y = \P{x) the optimal reconstruction, 
a simple rescaling ay would also be optimal because it 
would leave the neighborhood structure unchanged and 
therefore would not affect the computation of any dy- 
namical invariant nor the application of any prediction 
algorithm. However, in the calculation of the charac- 
teristic radii e& would be affected by the factor a, and 
therefore the resulting will also differ by a factor a. 
This is clearly undesirable: equally optimal reconstruc- 
tions should yield the same values. 

The hypothetical case described above is somewhat ar- 
tificial in the sense that it cannot occur when reconstruct- 
ing a dynamical system: no such global scaling factor a 
will unexpectedly affect interpoint distances for a par- 
ticular reconstruction among the set of all possible DC 
reconstructions which is obtained by varying parameters 
r and m. However, a subtle effect will be present upon 
variation of these parameters: the value of Ck(x) along 
the attractor will grow with larger delays due to the ir- 
relevance effect which stretches (and folds) the attractor. 

Figure [5] illustrates this behavior schematically. Each 
panel depicts a DC reconstruction given by ^1 and ^2, 
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FIG. 6. (Color online). Response of the characteristic radii 
6k (k = 2) to changes in the reconstruction parameters t w 
and m for the Lorenz time series (noise free and with 10% 
noise), (a) e k monotonically grows with t w and m in the 
noise free case, (b) The same holds true for the noisy case, 
(c) For the normalized radii el = Ck/y/rn the dependence 
with m is eliminated at the lower end of t w values, (d) In the 
noisy case -as well as for large t w in the noise-free case- the 
sampled points fill the ra-dimensional space and the growth of 
€k with m cannot be avoided. In all cases the thick dashed line 
corresponds to the full delay coordinate (£DC) reconstruction 
(t = 1 and m = t w + 1). 



where t\ < t<l. The attractor, represented by a curve, 
is sampled the same number of times in both cases. In 
this example the larger value of T2 induces a stretching 
and folding of the reconstructed attractor. However, the 
characteristic size L of the reconstructed attractors is the 
same as it is determined by the amplitude of the time 
series. On the other hand, the typical first neighbor dis- 
tance is larger for ^2, an effect that is captured by larger 
values of (e k ). However, larger values of (e^) produce 
in turn a smaller value of cr&, which goes in an opposite 
direction to a desired penalization of irrelevance. 

In Fig. [6] we use the time series from the x variable of 
the Lorenz system (see Appendix |P| for details) to profile 
the behavior of e k defined by 



N 



N ^ 



(19) 



as a function of the reconstruction parameters m and t w . 
As argued above, this figure quantifies how the character- 
istic distance between first neighbors averaged over the 
attractor grows with the size of the considered time win- 
dow. Additionally, the typical distance between neigh- 
bors will also grow with the dimension induced by the 
noise which populates all directions as pointed out in [25] . 
We therefore conclude that in order to be able to com- 
pare different reconstructions a normalization is needed 
to account for this changing average interpoint distance. 
We also notice that e k captures the local scale variations 



between reconstructions and is a measure of the degree of 
irrelevance of the considered delayed components. Tak- 
ing this argument further, the normalization we propose 
in this section follows from considering the average of 



o\ (x) over the attractor (Eq. Il8j) as a weighted average 



of E£(x) with weights Wk(x) proportional to e k (x) 
w k (x) = a k e k (x)~ 2 . 



i.e. 



(20) 



Across different reconstructions these weights should sat- 
isfy X^Wfc(£i) = 1, therefore the normalization factor is 



(21) 



in such a way that ot k <J k = w k {xi)E^{xi). 

This normalization gives the product a k cr k the units 
of x, the observed variable, independently of the pro- 
posed reconstruction. This will allow a direct compari- 
son between any kind of reconstruction for a given time 
series. If the statistics is so normalized, a k cr k will be up- 
per bounded by the standard deviation of the time series 
and lower bounded by the noise level. 



E. Overview 

Here we collect the arguments of the previous sections 
to construct a cost function L to guide the search of 
the optimal reconstruction. Ideally, such function can 
be thought of as a sum of two terms with competing be- 
havior 



L = R + XL 



(22) 



The R term should penalize redundance when the win- 
dow size t w is too small or, in case it is not, when the 
number of components within t w is insufficient to unfold 
the attractor. On the other hand, the / term should pe- 
nalize irrelevance when the window size t w is too large 
or, in case it is not, when the number of components 
within t w is unnecessarily larger than needed to unfold 
the attractor. 

Throughout this section we have shown the basis and 
definition of a k inspired on the definition of noise am- 
plification in [4], from which we arrived at a new in- 
terpretation in terms of the complexity of the dynamics 
g T . Furthermore, we have defined a normalization fac- 
tor a k needed to adjust a k to scale variations induced by 
stretching and folding of the attractor when irrelevant de- 
layed components enter the reconstruction vector. These 
two terms, a k and a^, are natural candidates to play the 
role of R and /. We therefore define 



and arrive to 



Rk = log 10 cr k 
h = logio a k 

L k = Rk + h 
= log 10 (a k a k ) 



(23) 
(24) 

(25) 
(26) 
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where the parameter A in Eq. (22) must necessarily be 
equal to 1, following t he con ception of a& as a normal- 
ization factor (Section HID). 



In Appendix |E| we give details on our implementation 
of the method proposed in this work, which is freely avail- 
able. 



IV. RELATED WORK 



A. Selection of k and relationship with the method 
of FNN of Kennel et al. 



If we only consider the first neighbor to compute (x) , 
i.e. k = 1, and choose the time step T equal to r reducing 
the sum in Eq. ( 15 ) to a single term, we exactly recover 
the statistics proposed by Kennel et al. [26] to detect false 
nearest neighbors, that is 



(71 (T = t,x) 



\x(t + T)-x'(t + T)\ 

d(x, x r ) 



(27) 



where x' is the first neighbor of x in the reconstruction 
and d(x, x') their distance and we question whether it is 
a true neighbor. The criterion used by Kennel et al. was 
to consider x and x' false neighbors if <J\{T = r, x) > 10. 

The method of Kennel et al. is therefore encompassed 
by our proposal as a particular case with k = 1. The 
main difference is given by the fact that we consider T 
in the interval [0,7m] instead of T = r. The time lag 
parameter r, or equivalently t w , must also be determined. 
Figure [7] shows the profiles of vs. t w for several values 
of k and benchmark time series. Choosing k = 1 is not 
always the best option because the obtained profiles can 
be too noisy for a correct determination of the optimal 
t w . Increasing the value of k regularizes the estimation 
of Var(x(T)\B e (x)) but also increases the size of Uk(x). 
The corresponding profiles of tend to be smoother 
or converge to a stable profile but the method becomes 
less sensitive to local divergences. A trade-off solution is 
given by the smallest value of k for which the profile 
is stable in the sense of consistency in the optimal t w 
values obtained, which are given by the position of the 
global minima of In general, we found k = 2 or 3 
to be a good choice for the time series considered in this 
paper, and suggest the use of these values for general 
applications. 



B. Relationship with prediction error 



Here we analyze the mean of E%(T,x) (from Eq. (13)) 
over the attractor instead of the proposed cost function. 
This is nothing else than the usual mean squared pre- 
diction error (MSE), which we will here compute using a 
local constant model based on the first k neighbors and 
will note E\ (T) . 



(a) Lorenz 




■ (c) Mackey— Glass 
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FIG. 7. (Color online). Profiles of Lj~ vs. t w for a range of k 
values and benchmark case studies, (a) Lorenz, m — 3. (b) 
Rossler, m = 3. (c) Mackey-Glass, m = 4. (d) Mackey-Glass 
with 10% noise, m — 4. For all time series studied in this 
work the structure of maxima and minima converges for low 
values of k. 





FIG. 8. Schematic representation of two reconstructions of 
the same region of an attractor with (a) well behaved orbits 
and (b) collapsed ones. The computation of Ek(T) using k = 
2 will yield identical results for both reconstructions while Lk 
will penalize the collapsed orbits of panel (b). 



The question we address in this Section is whether a 
minimization of E^iT) constitutes an appropriate crite- 
rion to select reconstruction parameter values instead of 
the more complex definition of L^{T). The main differ- 
ence between these two approaches is that Ek (T, x) does 
not carry any information about the size of the neigh- 
borhood U e (x). The relevance of this lack of information 
becomes evident when considering a reconstruction with 
a region of collapsed orbits. By 'collapsed orbits' we here 
refer to the case of true neighbors that become arbitrar- 
ily close for a given reconstruction — without involving 
the presence of false neighbors. This is illustrated in Fig. 
[8] and again on panel (b) of Fig. [9] Panels (a) and (b) 
show two different reconstructions of the same group of 
points; as we can see in panel (b) the orbits are collapsed. 
If we compute Ek(T) using k = 2 we arrive at identical 
values for both reconstructions. This follows from the 
fact that the distortion occurring in panel (b) does not 
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FIG. 9. Illustration of the sensitivity of Lk and Ek to an orbit 
collapse for the 3-dimensional delay coordinate (DC) recon- 
struction of the Rossler system, (a) Plot of Lk and Ek vs. r. 
Notice that at r = 7 the curves differ maximally, (b) Detail 
of the reconstructed attractor at r = 7. The reconstruction 
exhibits collapsed orbits at this specific delay value. 



alter the neighborhood structure. On the other hand, if 
we compute &k (T, x) for both reconstructions we will ar- 
rive at different values. The case of panel (b) will yield 
a larger value of (T, x) . This will follow from smaller 
neighborhood sizes in the denominator for regions of the 
reconstruction that exhibit a collapse. 

The situation of an orbit collapse described in this Sec- 
tion is typical of the 'redundance limit' (small t w ), but 
can also occur for intermediate t w values in low dimen- 
sional reconstructions. This effect can be illustrated for 
the three dimensional DC reconstruction of the Rossler 
system from the time series of the x variable (see Ap- 
pendix O). For r = 7 there is an orbit collapse as shown 
in Fig. ]9[b). Figure [9| a) shows the response of Ek as 
compared to for this three dimensional reconstruction 
as a function of r. Notice how the profile of Lk jumps at 
r = 7 while Ek does not. 

To support the contention that the response of 
shown in Fig. [9ja) corresponds to the collapsed region 
of Fig. §b) we use a|a|(T, x) as a local cost function 
to extract spatial information about the reconstruction 



quality. This has been done in Fig. [10] where the values 
of a^cr^(T, x) have been plotted on the attractor in the 
lower panels and as a function of x(t) (= xq) in the up- 
per panel. From left to right the reconstructions have 
r = 2, r = 5, r = 7 and r = 11 respectively. For r = 7 
(panel (c)) the values of Lk are clearly dominated by the 
behavior in the collapsed region. The same holds true 
for r = 11 (panel (d)) but in this case there are false 
neighbors as orbits cross each other [54] . 

with [IT] where a|<r|(T, x) has 



If we compare Fig. 
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been replaced by E^(T^x), we find that the local response 
to the collapse of orbits (r = 7) and to the presence of 
false neighbors (r = 11) is lower, especially as compared 
to the ground response. We note in passing that this 
ground response is in itself an interesting finding from 
this plot. More precisely, E%(T,x) exhibits a rich struc- 
ture along the attractor (e.g. along the radial direction 
on the spiral) even when the reconstruction is optimal 
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FIG. 10. (Color online). Sensitivity of Lk to collapsed orbits. 
The local cost function a|<7fc(T, x) is plotted vs. xo in the up- 
per panels and color-coded in the two dimensional projection 
of the reconstruction in the lower panels for reconstruction 
parameters m — 3 and (a) r = 2, (b) r = 5, (c) r = 7 and 
(d)r = ll. 
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FIG. 11. (Color online). Same as Fig.[Io|for E%(T,x). This 
quantity is less sensitive to a collapse of orbits (panel (c)) and 
false neighbors (panel (d)) as compared to its ground level, 
which presents a spurious structure (panels (a) and (b)). See 
the main text for details. 



(r = 5, panel (b)). E%(T,x) penalizes orbits of the at- 
tractor that do not seem to be more divergent than oth- 
ers where however E%(T,x) is smaller. Instead, these or- 
bits belong to lower density regions, which yields larger 
values of E%(T,x). As expected, E%(T,x) evaluates the 
quality of the prediction algorithm (k-NN in this case) 
rather than reconstruction quality. On the other hand, 
the variability of a^cr^(T, x) across the attractor shown 
in the lower panel of Fig. [l0[b) is less severe and only 
associated to intrinsic divergence of orbits. 

In Small et al [12] the authors explored the use of 
the fc-NN prediction algorithm for the selection of opti- 
mal reconstruction parameters. The quantity they pro- 
posed to minimize is the description length (DL) of the 
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time series using this particular modeling approach and a 
candidate reconstruction. In the particular case of fc-NN 
the description length reduces to the logarithm of Ek (T) ; 
therefore, the previous critiques in this Section apply to 
their method. Another important drawback is that their 
analysis of the method is reduced to k = 1 and T = 1. 
The authors also referred to previous works based on 
minimizing DL for different prediction algorithms (radial 
basis functions [51] and neural networks [52]). They con- 
cluded that for these algorithms DL is harder to estimate, 
and that their free parameters are harder to optimize for 
the purpose of determining reconstruction parameters. 



C. Relationship with dynamical methods 



100 



The dynamical methods [22-25 mentioned in the In- 
troduction are also based on the detection of FNNs and 
provide a criterion for choosing m and r. The method- 
ology we propose in this work can also be considered a 
dynamical method. In the following we will discuss the 
most relevant differences between these methods and our 
approach. 



1. Gao and Zheng 

In [24 Gao and Zheng used a quantity close to <ti(T, x) 
(notice k = 1) to determine the reconstruction parame- 
ters m and r, and also to estimate the maximum Lya- 
punov exponent of the dynamics. This quantity is given 

by 



R(T, x) = 



d(x(T),x'(T)) 



(28) 



where the only difference with o~\ (T, x) is that both nu- 
merator and denominator are given by distances com- 
puted in the reconstructed space. They then defined the 
cost functions to be minimized as A = (ln(R(T,x))) and 
A + = (ln(R(T,x))) + , where in A + the mean is taken 
over positive values of ln(R(T,x)) only. Mean values of 
logarithmic divergence are suboptimal for the purpose of 
detecting collapsed orbits. However, the main critique 
we make to the method of Gao and Zheng is the use of a 
distance in the full reconstructed space in the numerator 
to measure divergence of neighboring orbits. Given a DC 
vector x — (x(t — (m — l)r), ...x(t — r),x(t)) at time t, 
its image under the dynamics for a time step T, x(T), 
will share m — 1 components with x when T = r. This 
constraint, which is inherent to the DC reconstruction 
and independent of the time series under analysis, will in 
general affect the distance d(x{T), x'(T)) in the numera- 
tor of Eq. ( 28 ) producing an artificial minimum at r = T 
in a A vs. r profile. 

In Fig. [12] we plot the time window t w which mini- 
mizes L/c as a function of parameter Tm for the DC re- 
constructions of the noise-free Mackey- Glass time series 
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FIG. 12. (Color online). Optimal window size t w (in an Lk 
sense) as a function of the horizon parameter Tm for (a) the 
noise-free Mackey-Glass case and (b) the noisy case. Panels 
(a) and (b) also show t w vs. T as obtained by applying Gao's 
method (dashed gray line) which exhibits a high dependence 
on parameter T. The dashed vertical lines signal the hori- 
zon parameter Tm equal to the first maximum of the upper 
curve of the space-time separation plot [53] for the time series 
without any reconstruction (panel (c)). 



(panel (a)) and also for the noisy case (adding 10% i.i.d. 
noise, panel (b)). For increasing dimensions m from 2 
to 4 we see that the time window converges to a stable 
value if Tm is large enough. In contrast, for small val- 
ues of Tm the value of t w is unstable. This is due to the 
high correlation between successive values of oversampled 
time series which implies that x(T) will be determined by 
x independently of the reconstruction quality. In order 
to characterize a proper interval [0, Tm] for T we built 
a space-time separation plot [53] for the time series un- 
der consideration without any reconstruction as shown 
in Fig. [l2|c). The upper curve corresponds to the 95% 
percentile of the distribution of \x(t + T) — x(t)\, i.e. this 
curve is almost an upper bound for \x(t + T) — x(t)\. 
We use this curve to choose Tm as the time of its first 
maximum in order to ensure that nearby orbits are given 
enough time to diverge. For this time series we obtain 
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Tm = 44, a value that avoids the fluctuations in t w shown 
in panels (a) and (b). Notice however that any value of 
Tm between 30 and 100 is equally valid, and that no 
particular choice of this parameter is critical in order to 
obtain consistent results. The use of the 95% percentile 
curve from the space-time separation plot has given ap- 
propriate Tm values that avoid fluctuations in t w for all 
time series considered in this paper (plots of t w vs. Tm 
not shown). 

In [24] the Authors give no suggestion to choose an 
adequate value for T. Indeed, no value of T ensures an 
independent output for r as discussed previously in this 
Section. As shown in Fig. [l2^a) the t w obtained with 
Gao's method has a strongly fluctuating dependency on 
T. This situation even deteriorates when noisy time se- 
ries are considered (panel (b)), in which case one identi- 
cally obtains r = T as previously explained. 



the value of T should be the same for all reconstructions 
independently of the candidate delay time r. This leads 
to wrong conclusions about the optimal delay time, be- 
cause the smaller are r and T, the lower is the expected 
divergence and therefore fewer false neighbors will be de- 
tected using a fixed threshold. The authors correctly 
diagnosed this problem and proposed to apply a linear 
transformation to the reconstructed space before using 
the false strands algorithm. The purpose is to spread 
out the attractor which, in the case of a small r, tends 
to be collapsed onto the identity line. We found this 
solution unsatisfactory because it requires to transform 
the reconstruction we intend to evaluate. Secondly, the 
suppression of the denominator in Eq. (27) prevents the 
detection of collapsed orbits as discussed in Section [IV B| 



4 . Summary 



2. Buzug and Pfister 

Close to the definition of A by Gao and Zheng is the 
averaged local deformation proposed by Buzug and Pfis- 
ter in an earlier work [22 . Their divergence measure is 
obtained similarly to A but with a small difference in the 
expression of Eq. (28): the distance from the reference 



point x to its first neighbor x' is replaced by the distance 
to the center of mass of the cloud of neighbors initially in- 
side a fixed-radius-ball around the reference point (both 
in the numerator and denominator). Buzug and Pfister 
give no justification for the latter choice, which has two 
drawbacks: first, as recognized by the Authors, in noisy 
conditions the center of mass tends to be close to the ref- 



erence point, introducing a divergence in Eq. (28). Sec- 



ond, the divergence of false neighbors is underestimated. 
More precisely, if the cloud is split when the dynamics 
evolves (assuming that there were false neighbors), the 
distance of x to the center of mass is about half of the 
distance between the two clouds. This effect makes it 
harder to discriminate false from true neighbors. 



3. Kennel and Abarbanel 

In 2002 Kennel and Abarbanel [25] presented an im- 
proved version of their false nearest neighbor method [26 . 
This new method, which mainly deals with the case of 
oversampled time series, also produces an estimate of the 
optimal delay time. The strategy is based on consider- 
ing nearest strands, which are sets of nearest neighbors 
which are close in time and characterize nearest orbits 
rather than nearest neighbors. The divergence of near- 
est neighbors is then averaged over neighbors in a strand 
and compared to a threshold. The divergence measure is 
similar to Eq. (27), also with T = r but the denominator 



is absent (it is replaced by the standard deviation of the 
time series for normalization purposes). The proposed 
methodology has two undesirable consequences. First, 



In summary, common drawbacks of existing dynamical 
methods in the literature are the following: i) Results in 
general depend on T as in [23] . Usually the value of T 
is either fixed to be equal to r as in [23] [25] , or equal 
to the sampling time as in [22 , which is suboptimal and 
sampling dependent, ii) The global reconstruction qual- 
ity measure is an average of the logarithm of a divergence 
metric which does not fully capture the presence of false 
neighbors as in [22 - 24 or is a count of divergences above 
a threshold which is fixed ad hoc as in [25] . iii) The num- 
ber of neighbors is fixed to k = 1 as in [12] I23H25] while 
in some cases it is desirable to consider k > 1 to gain 
robustness. 

However, the main contribution in our proposal is not 
how to deal with these drawbacks but to introduce the 
normalization factor which penalizes irrelevance by 
adjusting the scales in the reconstructed attractor. We 
have not found in the literature any irrelevance measure 
based on the characteristic distance among neighbors. 
This measure explicitly uses the fact that irrelevance 
stretches (and folds) the attractor. Furthermore, this ir- 
relevance measure is incorporated into the cost function 
as a normalization factor avoiding the introduction of ex- 
tra parameters. The factor is the key ingredient in our 
reconstruction quality measure — it is responsible for an 
absolute minimum in our cost function when screening 
all possible values of t w . 



V. PRACTICAL APPLICATIONS 

A. Noisy time series 

For the case of noisy time series we found, in general, 
that the higher the number of delayed components con- 
sidered inside the selected time window, the lower the 
value of I//- is. An illustration of this behavior is given 
in Fig. [l3^a) where we show the profiles of vs. t w for 
the Mackey- Glass time series with 10% (amplitude) i.i.d. 
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dnoisy produces a correct rank of nearest neighbors. This 
result was anticipated in [4] from an information point of 
view: more information implies less distortion, therefore 
the distortion is a monotonic, nonincreasing function of 
the dimension. Our cost function is reflecting this be- 
havior because it is based on the theoretical definition of 
noise amplification cr(T, x) as defined in [4]. 



B. Discrete Legendre polynomials 
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FIG. 13. (Color online). Cost function vs. time window size 
t w for the Mackey-Glass series plus 10% noise, (a) Delay 
coordinate (DC) reconstructions with different dimensions m 
as in Figure [I] (b) Legendre coordinates reconstruction of 
different dimensions n. Both panels show the curve for the 
fDC reconstruction (r = 1 and m = t w + 1). Note that the 
Legendre profiles reach values below the latter curve. The 
vertical dashed line indicates t w = r* . 



observational noise. For this time series the minimal em- 
bedding dimension is m = 4 in the noise-free case (see 
Fig. [I]). However, in the noisy case the profiles of Lk vs. 
t w do not reach their minimum for a small dimension m, 
and the minimum of Lk corresponds to the case of includ- 
ing every possible delayed value inside the time window 
of the reconstructed vector. We previously called this 
high dimensional vector the full delay coordinate (fDC) 
vector for a given time window, and it has a dimension 
m = t w + l (where t w is expressed in sampling time units) 
which depends on the sampling. The behavior of Lk can 
be explained by noticing that the inclusion of redundant 
components allows noise filtering (given that the noise 
is i.i.d.). Noise filtering is performed in the computation 
of the Euclidean distances between pairs of reconstructed 
vectors since the Euclidean distance is essentially an aver- 
age of the quadratic distance for each component. More 
precisely, for large m the distance d^ oisy between two 
noisy DC vectors tends to be [55 



d 2 ~ d 2 

^noisy 00 ^clean 



■2m4 2 



(29) 



where d 2 lean indicates distance between clean vectors and 
£ 2 is the noise variance. As the second term does not de- 
pend on the specific pair of vectors (assuming i.i.d. noise), 



In the previous sections we have only considered delay 
coordinate reconstructions, and found that high dimen- 
sional reconstructions yield minimum noise amplification 
in the case of noisy time series. One can apply a further 
transformation ^ : R m — >> R n in order to reduce the di- 
mensionality of the delay coordinate vector. To this end 
we can use a linear ^ which projects a fDC vector from 
t w + 1 dimensions onto n directions given by, for exam- 
ple, the first n discrete Legendre polynomials. Therefore, 
the free parameters for the Legendre coordinates that we 
need to determine are the time window width t w and the 
dimension n. 

Our aim is to show here how our cost function allows 
the evaluation of more general reconstructions than just 
delay coordinates. Notice that the proposed method can 
be used to determine whether applying a further trans- 
formation ^ improves the reconstruction quality. In ad- 
dition, it allows the computation of optimal parameters 
for this reconstruction exactly in the same way as for 
delay coordinate reconstructions. 

Figure I3 {b) shows Lk vs. t w for n = 1, . . . , 6 for the 



the same noisy Mackey-Glass time series as in panel (a). 
We see that the minimal embedding dimension n = 4 
is recovered despite the presence of noise. However, the 
optimal time window width t w = 113 is much larger than 
in the noise- free case. This difference can be explained 
by considering that, due to the presence of noise, the 
information about the system state carried by measures 
with larger delay times is now more relevant relative to 
the larger uncertainty on the system state (reflected by 
larger values of Lk with respect to the noise free case). 



As pointed out in Section [HC] we expect that the optimal 
embedding parameters depend on the noise level. 

Finally, we would like to notice that this transforma- 
tion \P not only reduces the dimensionality from m = 
t w + 1 to n = 4 but also the value of the cost function 
with respect to the lowest level attained with a DC re- 
construction (which is obtained for the fDC vector and 
plotted in both panels of Fig. 13 with a gray dashed line). 



C. Chua's circuit 

We now consider a real time series from a practical 
implementation of Chua's circuit. The data correspond 
to measurements of inductor current values performed in 
[56] and can be retrieved from [57] . The data are depicted 
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FIG. 14. Chua's circuit data [56]. (a) The full length time 
series, (b) Detail of the first 1500 data points. The character- 
istic period is approximately 94 sampling time units, (c) Pre- 
liminary 2-dimensional delay coordinate reconstruction using 
r = 23 (which corresponds to a quarter of the characteristic 
period). 



in Fig. 14 This time series has a significant amount of 
observational noise mainly due to a small resolution in 
the A/D conversion and to the Hall-effect probe used to 
measure the current through the inductor [56 . 

For this noisy time series we explored both delay and 
Legendre coordinates. Figure 15 'a) shows the profile of 
Lfe vs. t w for DC reconstructions of increasing dimension 
m from 2 to t w + 1. These results are consistent with 
the ones obtained for the noisy Mackey- Glass case (Fig. 
13 ^a)), where the minimum value of is reached when 
the DC vector incorporates all delayed values inside the 
selected time window. 

In Fig. 15 'b) we show the curves vs. t w correspond- 
ing to Legendre coordinates for increasing dimensions n. 
For comparison we also plot the profile corresponding to 
the £DC reconstruction (plotted with a gray dashed line 
as in panel (a)). Again, as observed for Legendre coordi- 
nate reconstructions of the noisy Mackey- Glass time se- 
ries, the absolute minimum of is now reached for a low 
dimensional reconstruction with parameters n = 3 and 
t w = 81. This type of reconstruction significantly reduces 
the minimum value of Lk obtained with a DC reconstruc- 
tion, which is in turn due to the noise reduction effect of 
projecting onto the discrete Legendre polynomials. 

Now we compute from Eq. ([5| in order to find the 
upper bound for which the analytical solutions are equiv- 
alent to PCA. Simply applying Eq. ([5| to the raw data 
leads in general to an overestimation of ((dx/dt) 2 }. A 
solution is to smooth the data with a Savitzky-Golay fil- 
ter [58] . This type of filter has two free parameters: the 
length of the fitting window and the order of the fitted 
polynomial. Using a fitting window of 7 points and poly- 
nomials of order 2 we found r* ~ 96. The time window 
width obtained by minimizing Lk is less than but close 
to the upper bound t*, in agreement with the guidelines 
given in [8 to choose t w in order to maximize the sig- 
nal to noise ratio and simultaneously avoid irrelevance 
effects. Furthermore, t w < r* implies that the Legen- 
dre coordinate reconstruction which is optimal in terms 
of Lk is likely performing a PCA over the £DC recon- 
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FIG. 15. (Color online). Cost function vs. time window size 
t w for Chua's circuit time series. Panel (a) corresponds to 
delay coordinate (DC) reconstructions with different dimen- 
sions m as in Figure [I] Panel (b) corresponds to Legendre 
coordinates of different dimensions n. Both panels show the 
profile for the fDC reconstruction (r = 1 and m = t w + 1). 
The vertical dotted line indicates the value of r* 



struction. This is confirmed by Fig. [16] where from three 
independent views of the reconstruction we see that the 
Legendre coordinates are aligned with the principal di- 
rections of the attractor. 



D. SFI Laser 

The last data set we consider in this work corresponds 
to measurements of fluctuations in a far-infrared laser 
taken from the Santa Fe Institute time series prediction 
competition [60] and can be retrieved from [61] . The full 

panel (a), and further 
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time series is plotted in Fig. 
details on different time scales in panels (b) and (c). The 
average pulsation frequency is approx. 1.65 MHz and the 
sampling time is 80 ns. Therefore, the time series is sam- 
pled approximately 7.6 times per cycle (see panel (c) of 
Fig. 17), which is a very low sampling frequency as com- 
pared to the previous examples. Notice that the time 
series covers a larger numbers of cycles than in previ- 
ous examples. This in turn produces a higher number 
of neighbors outside the Theiler exclusion window and 
therefore more points are effectively available to compute 
the cost function. However, some problems may arise due 
to undersampling: i) inside the embedding window we 
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FIG. 16. (Color online). Views of the 3-dimensional Legendre 
coordinate reconstruction of Chua's circuit attractor. (a) y± 
vs. yo, (b) ?/2 vs. yo and (c) y2 vs. y±. (d) Perspective view 
of the reconstructed attractor. The local cost function over 
the attractor is color-coded. The principal directions of the 
attractor (in the PCA sense) are aligned with the coordinate 
axes. The local cost function does not present localized re- 
gions of high values (as e.g. in Fig.[2jb)) implying the absence 
of false neighbors. 



may not find enough observations to unfold the attrac- 
tor, ii) in the case of noncoherent dynamics, the sampling 
could be insufficient to describe the faster oscillations as 
discussed in [62] , and iii) the optimal value of t w can not 
be studied with the desired resolution. From the descrip- 
tion and analysis of the time series given in [60] we can 
in this case eliminate possibilities i) and ii) above. 

Figure 18 'a) shows the values of Lk for DC reconstruc- 
tions as a function of t w for different dimensions m. The 
absolute minimum of Lk occurs for t w = 16 and m = 17, 



i.e. considering all delayed values inside the selected win- 
dow (fDC). 

As discussed in Sec. |V A[ high dimensional DC recon- 
structions allow noise filtering when computing distances 
between vectors and this is probably the reason why the 
absolute minimum is found at m = 17. From [l8^a) it 
can be argued that a reconstruction of dimension m = 5 
can be achieved with only a slight increase in the cost 
function. To investigate whether this increase is relevant 
for reconstruction quality we considered the distribution 
of the local cost function in the same way as it was done 
in Fig. [5] We found (figure not shown) that this slight 
increase in is due to large increases of the local cost 
function at localized regions of the attractor which in- 
volve a small fraction of points (as it can also be observed 
in Fig. [2jlo) ) . This gives a meaning to the difference be- 
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FIG. 17. SFI Laser data, (a) Full-length time series, (b) 
Detail of the first 1000 samples, (c) Zoom on the first 100 
points. The characteristic period is approximately 7 sampling 
time units. 



tween m = 5 and m = 17 in the global cost function 
which can be then considered relevant. As far as the 
time window width is concerned, we found that it ap- 
proximately corresponds to the size of two characteristic 
periods. We therefore see that, in contrast to previously 
considered systems, our analysis suggests that for this 
time series the irrelevance time is larger than the charac- 
teristic period. 



In Fig. 18 'b) we plot for each dimension m the min- 
imum value of Lk over t w for the DC reconstructions 
of panel (a), and also for Legendre coordinates. In the 
latter case the minimum of Lk is reached at t w = 19 
and n = 20, which means that no dimension reduction is 
achieved. 

To illustrate the versatility of the proposed approach, 
we also considered PCA coordinates in order to compare 
with the performance of the Legendre approach. More 
precisely, we applied PCA to the t w = 16 fDC recon- 
struction, which was the optimal DC reconstruction, and 
computed Lk for reconstructions of increasing dimension 
by sequentially incorporating the PCA components in de- 
creasing order of their corresponding eigenvalue (full line 
profile in Fig. [l8^b)). After the 7th PCA component Lk 
falls below the Legendre profile and reaches a minimum, 
thereby substantially reducing the dimensionality of the 
representation. 



Figure [19] shows a 3-dimensional reconstruction using 
coordinates 2/0? V\ an d Uq obtained from PCA. The recon- 
structed attractor exhibits a structure reminiscent of the 
Rossler attractor: a growing spiral in the (2/0? plane 
with a reinjection of orbits at different radii. Indeed, 
the equations describing the system dynamics are equiv- 
alent to Lorenz equations but the symmetric two-spiral 
structure of the Lorenz attractor is collapsed into one 
single spiral by the measurement process [60]. The val- 
ues of the local cost function have been computed for the 
n=7 dimensional reconstruction and color-coded in the 
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FIG. 18. (Color online). Evaluation of the cost function 
for different reconstruction strategies applied to the SFI laser 
time series, (a) Lj~ vs. window size t w for ra-dimensional DC 
reconstructions and for the fDC reconstruction (thick dashed 
gray line), (b) As a function of the dimension m, we plot i) 
Lk for fDC (thick dashed gray line), ii) min^Lfc for DC, hi) 
mint^Lfc for Legendre coordinate reconstructions, and iv) Lk 
for PCA reconstructions. In the latter case, principal com- 
ponents are incorporated in decreasing order of their corre- 
sponding eigenvalues (full line). 



3-dimensional projection of Fig. 19 Notice the costly re- 



gion corresponding to the reinjection of orbits into the 
spiral. In contrast, in flatter regions of the spiral orbits 
are more predictable and show lower local objective func- 
tion values. 



E. Non-uniform delay coordinate reconstructions 

We now report an exploration of the potential of this 
approach for the construction of non-uniform delay co- 
ordinate (nuDC) vectors, i.e. the case where consecutive 
delayed coordinates are not equidistant. We briefly re- 
port a case study taken from Pecora et a/., who in [40] 
introduced an iDC reconstruction method and used it to 
analyze a quasiperiodic, multiple time-scale time series of 
the x-coordinate of a orbit in a two-dimensional torus liv- 
ing in a 3-dimensional space. The two frequencies are Ui 
and U2 with U2 = 2.5ttuji and the time series is sampled 
32 times per fast cycle. 

On one hand, applying a greedy search algorithm, Pec- 
ora et al arrived at a 4-dimensional reconstruction with 
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FIG. 19. (Color online). Views of a 3-dimensional reconstruc- 
tion of the SFI laser attractor using PCA coordinates . (a) y\ 
vs. ?/o, (b) ye vs. yo and (c) y§ vs. y±. (d) View in perspective 
of the reconstructed attractor. The local cost function for 
the 7-dimensional PCA reconstruction is color-coded over the 
attractor and penalizes regions of high divergence. A spline 
interpolation was used to connect points in order to guide the 
eye for this undersampled time series. 



delay times t\ = 8, r2 = 67, and T3 = 75. The first 
delay t\ = 8 captures the fast frequency, being exactly 
1/4 of the corresponding period (this fraction is the time 
lag where autocorrelation vanishes for harmonic signals) . 
However, T2 = 67 slightly fails to capture the slow fre- 
quency (it should be T2 = 63). 

On the other hand, we exhaustively searched over the 
complete space of parameters {m, ti, T2, T( m _i)} for 
the minimum value of According to our proposed 
measure the optimal iDC reconstruction is attained 
for parameter values m = 4, t\ = 8, T2 = 63, and T3 = 71. 
As we can see, in this solution t\ and T2 exactly capture 
the fast and slow frequencies of this torus time series. 

Finally, we used independent sets of random samples 
and ran bootstrapping experiments to compare the value 
of Lk on the above reconstructions. First, we found that 
the iDC reconstruction obtained by the approach here 
proposed is significantly better, in a statistical sense, 
than the one found by Pecora et al. This result was 
expected by construction (except possibly the statistical 
significance). Secondly, we also found that according to 
our measure the iDC reconstruction obtained by Pecora 
et al. is in turn statistically significantly better than the 
best possible uniform DC reconstruction. 
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VI. CONCLUSIONS 

In this work we considered the reconstruction prob- 
lem as an optimization case, and proposed an objective 
function to guide the search for an optimal state space 
reconstruction. This cost function is based on 1) the 
hypothesis of an underlying deterministic dynamics, 2) 
theoretical arguments on noise amplification, and 3) the 
idea of minimizing the complexity of the reconstruction. 
It incorporates a novel irrelevance measure based on the 
characteristic distance to nearest neighbors in the recon- 
structed space. The latter statistics captures in a simple 
and intuitive way attractor stretching — a typical feature 
of overfolded reconstructions. 

The proposed objective function can be evaluated on 
any reconstructed attractor, thereby enabling a direct 
comparison among different approaches: (uniform or 
non-uniform) delay vectors, PCA, Legendre coordinates, 
etc. It can also be used to select the most appropri- 
ate parameters of a particular reconstruction strategy by 
searching for the absolute minimum of the advocated cost 
function. For example, in the case of delay coordinates 
the search for the optimal delay time and embedding di- 
mension can be automated by simply exploring the cor- 
responding parameter space. The absolute character of 
this search is in contrast with subjective choices of other 
methods in the literature, such as the value of a thresh- 
old to define false neighbors, or a tolerance limit to dis- 
criminate a noise-induced fluctuation from a true relative 
minimum. 

Our approach has only two free parameters: the num- 
ber of nearest neighbors k and the prediction horizon Tm- 
We have given a simple guidance to choose appropriate 
ranges for these parameters, where results depend mildly 
on the particular configuration and the method returns a 
robust output. Code implementing the proposed method 
is freely available (see Appendix [e|. 

We applied the proposed method for the analysis of 
several synthetic and experimental times series. Among 
the latter we considered field measurements from an ex- 
perimental realization of Chua's circuit and a far-infrared 
laser taken from the Santa Fe Institute time series pre- 
diction competition. In particular, we used the latter 
examples to demonstrate the ability of the proposed ap- 
proach to handle different types of reconstructions, which 
we believe to be a distinctive and powerful feature of this 
method. In all cases we found a well defined absolute 
minimum of the objective function. 

In the particular case of delay coordinate embeddings 
we found the interesting result that the time span of the 
optimal reconstruction window is not necessarily related 
to the characteristic period of the time series under con- 
sideration. The results obtained for the SFI laser time 
series suggest that measurements from more than one cy- 
cle in the past are relevant for the reconstruction of the 
system state. 

As a final remark, we would like to notice that in the 
present work we have not proposed a new reconstruction 



strategy but a new methodology to measure the quality 
of a reconstruction. From the case studies analyzed in 
this work we conclude that none of the considered recon- 
struction techniques (delay coordinates, PCA, Legendre 
coordinates) is universally optimal. From a practitioner's 
point of view, the proposed cost function is therefore use- 
ful to assess which is the most appropriate reconstruction 
method for the particular time series under study. 
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Appendix A: Forecasting accuracy comparison 

As an objective validation of the proposed approach we 
compared the forecasting accuracy of local linear mod- 
els in reconstruction spaces obtained with the standard 
approach (Mutual Information + False Nearest Neigh- 
bors) and our method. For the Mackey- Glass time se- 
ries the standard embedding method gives m = 4 and 
r = 22 (an embedding window of size t w = 66) while 
the proposed method gives m = 4 and r = 10 (i.e., a 
smaller window size t w = 30). For the ^-coordinate of 
the Rossler time series the standard embedding space 
is m = 3, r = 11 (t w = 22) while our approach gives 
m = 3, r = 5 (t w = 10). Finally, for the x-coordinate 
of the Lorenz system the standard method yields m = 3, 
r = 16 (t w = 32) and the proposed one m = 3, r = 6 
(t w = 12). 

In these reconstruction spaces, for a given vector y(t) 
in an independent test set (not used to determine the re- 
construction parameters) the prediction algorithm iden- 
tifies the first k neighbors among the N available in the 
training set (also used to determine the reconstruction 
parameters) and locally fits a linear model. 

A comparison of the normalized root mean squared 
prediction error on data sets from the Mackey-Glass, 
Rossler, and Lorenz systems (Appendices |Bj[C] and |D] re- 
spectively) is depicted in Fig. [3j In the upper panels we 
show the prediction error as a function of the number of 
neighbors k (more precisely, as a function of the fraction 
k/N where N is the size of the training set). We fixed the 
horizon T at the first maximum of the space-time sepa- 

In the lower 



ration plot as discussed in Section IV C 1 



panels of Fig. [3] we show the prediction error as a function 
of the horizon T (expressed in units of the characteris- 
tic period of each time series) for a fixed, non-optimized, 
arbitrarily chosen number of neighbors k = 15. 
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Appendix B: Mackey- Glass dataset 



Appendix D: Lorenz dataset 



We considered the Mackey- Glass equation 



dx 
~dt 



ax(t — r) 
l + x c (t-r) 



bx 



with parameters a = 0.2, b = 0.1, c = 10, and r = 17. 
Using as initial conditions x(t < 0) = and x(t = 0) = 
1.2, after integrating this infinite-dimensional differential 
equation and sampling every St = 0.5 we kept a 10,000- 
element time series following the first 2,000 iterations 
which were discarded as a transient. We also computed 
a 50,000-element continuation used only as a test set in 
the prediction task, and the same was done for Rossler 
and Lorenz below. 



The Lorenz system [63] is described by the equations 

dx/dt = a(y — x), 
dy/dt = x(p -z)-y, 
dz I dt = xy — j3z 

with parameters a = 10, p = 28, and j3 = 8/3. We 
initialized the system at# = 2/ = l,z = 50 and in- 
tegrated with a 4th order Runge-Kutta procedure with 
step St = 0.01. After a transient of 1,000 data points we 
kept 10,000 samples from the ^-coordinate of this flow, 
and additional 50,000 for independent testing. 



Appendix E: Implementation 



Appendix C: Rossler dataset 

For the Rossler system [64] we integrated the equations 

dx/dt = —(y + z), 
dy/dt = x + ay, 
dz/dt = P + z{x-i) 

with parameters a = 0.15, j3 = 0.2, and 7 = 10. As ini- 
tial conditions we used x = y = z = 1 : we then employed 
a 4th order Runge-Kutta numerical integration method, 
sampled the ^-coordinate of the obtained trajectory with 
a step St = 0.125, and finally discarded the first 8,000 
data points keeping a time series of length 10,000, plus 
an extra 50,000 for testing. 



The computation of the cost function requires to 
perform nearest neighbor searches in high dimensional 
spaces. To this end we use a box- assisted algorithm for 
efficient neighbor searching [2]. This algorithm is an ex- 
tension of the False Nearest Neighbor (FNN) algorithm 
of the TISEAN package [59]. This extension allows the 
search of k nearest neighbors of x to define the set Uk (x) 
where none of the neighbors belong to the same Theiler 
window. 

An implementation in C code of the method proposed 
in this work can be downloaded from [65]. The program 
computes the global and local cost function for DC, fDC 
and Legendre coordinates, which are internally imple- 
mented. Alternatively, it also allows loading reconstruc- 
tions from external files, and then computes correspond- 
ing local or global cost function values. We also provide 
scripts to reproduce Figs. [Tj |2| [l5^b) and 16 
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